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Abstract —In recent years, distinctive-dictionary construction 
has gained importance due to his usefulness in data processing. 
Usually, one or more dictionaries are constructed from a 
training data and then they are used to classify signals that 
did not participate in the training process. A new dictionary 
construction algorithm is introduced. It is based on a low-rank 
matrix factorization being achieved by the application of the 
randomized LU decomposition to a training data. This method is 
fast, scalable, parallelizable, consumes low memory, outperforms 
SVD in these categories and works also extremely well on large 
sparse matrices. In contrast to existing methods, the randomized 
LU decomposition constructs an under-complete dictionary, 
which simplifies both the construction and the classification 
processes of newly arrived signals. The dictionary construction 
is generic and general that fits different applications. We 
demonstrate the capabilities of this algorithm for file type 
identification, which is a fundamental task in digital security 
arena, performed nowadays for example by sandboxing 
mechanism, deep packet inspection, firewalls and anti-virus 
systems. We propose a content-based method that detects file 
types that neither depend on file extension nor on metadata. 
Such approach is harder to deceive and we show that only a 
few file fragments from a whole file are needed for a successful 
classification. Based on the constructed dictionaries, we show 
that the proposed method can effectively identify execution code 
fragments in PDF files. 

Keywords. Dictionary construction, classification, LU decomposi¬ 
tion, randomized LU decomposition, content-based file detection, 
computer security. 


I. Introduction 

Recent years have shown a growing interest in dictionary 
learning. Dictionaries were found to be useful for applications 
such as signal reconstruction, denoising, image impainting, 
compression, sparse representation, classification and more. 
Given a data matrix A, a dictionary learning algorithm pro¬ 
duces two matrices D and X such that ||A — DX\\ is small 
where D is called dictionary and A is a coefficients matrix 
also called representation matrix. Sparsity of X, means that 
each signal from A is described with only a few signals (also 
called atoms) from the dictionary D. It is a major property 
being pursued by many dictionary learning algorithms. The 
algorithms, which learn dictionaries for sparse representations, 
optimize a goal function mino^xH^ — DX\\ + A||A||o, 
which considers both the accuracy and the sparsity of the 
solution, by optimizing alternately these two properties (A is 
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a regularization term). This construction is computationally 
expensive and does not scale well to big data. It becomes 
even worse when dictionary learning is used for classification 
since another distinctive term in addition to the two afore¬ 
mentioned is being introduced in the objective function. This 
term provides the learned dictionary a discriminative ability. 
This can be seen for example in the optimization problem 
tAa\D,x,w 11^ — DX\\ -f A||A||o -f ^||iT — WX\\ where W 
is a classifier and H is a vector of labels. \\H — IUA|| is the 
penalty term for achieving a wrong classification. In order to 
achieve the described properties, dictionaries are usually over¬ 
complete, namely, they contain more atoms than the signal 
dimension. As a consequence, dictionaries are redundant such 
that there are linear dependencies between atoms. Therefore, 
a given signal can be represented in more than one way using 
dictionary atoms. This enables us on one hand to get sparse 
representations, but on the other hand it complicates the rep¬ 
resentation process because it is NP-hard to find the sparsest 
representation for a signal by an over-complete dictionary 03. 

In this work, we provide a generic way to construct an 
under-complete dictionary. Its capabilities will be demon¬ 
strated for signal classification task. Since we do not look 
for sparse signal representation, we remove the alternating 
optimization process from the construction of over-complete 
dictionaries. Our dictionary construction is based on matrix 
factorization. We use the randomized LU matrix factorization 
algorithm M for a dictionary construction. This algorithm, 
which is applied to a given data matrix A G of m 

features and n data-points, decomposes A into two matrices 
L and U, where L is the dictionary and U is the coefficient 
matrix. The size of L is determined by the decaying spectrum 
of the singular values of the matrix A, and bounded by 
mm{n,m}. Both L and U are linearly independent. The 
proposed dictionary construction has couple of advantages: it 
is fast, scalable, parallelizable and thus can run on GPU and 
multicore-based systems, consumes low memory, outperforms 
SVD in these categories and works extremely well on large 
sparse matrices. Under this construction, the classification of a 
newly arrived signal is done by a fast projection method that 
represents this signal by the columns of the matrix L. The 
computational cost of this method is linear in the input size, 
while in the under-complete case finding the optimal solution 
is NP-hard M- Approximation algorithms for sparse signal 
reconstruction, like Orthogonal Matching Pursuit ini or Basis 
Pursuit 0, have no guarantees for general dictionaries. 


2 


In order to evaluate the performance of the dictionaries, 
which are constructed by the application of the randomized 
LU algorithm to a training set, we use them to classify file 
types. The experiments were conducted on a dataset that 
contains files of various types. The goal is to classify each 
file or portion of a file to the class describing its type. To 
the best of our knowledge, this work is the first to use 
dictionary learning method for file type classification. This 
work considers three different scenarios that represent real 
security tasks: examining the full content of the tested files, 
classifying a file type using a small number of fragments from 
the file and detecting malicious code hidden inside innocent 
looking files. While the first two scenarios were examined by 
other works, none of the papers described in this work dealt 
with the latter scenario. It is difficult to compare our results 
to other algorithms since the used datasets are not publicly 
available. For similar testing conditions, we improve the state- 
of-the-art results. The datasets we used will be made publicly 
available. 

The paper has the following structure: Section [H] reviews 
related work on dictionary construction and on file content 
recognition algorithms. Section in presents the dictionary 
construction algorithm. Section |IV] shows how to utilize it to 
develop our classification algorithms for file content detection. 
Section|V]addresses the problem of computing the correct dic¬ 
tionaries sizes needed by the classifiers. Experimental results 
are presented in Section |VT] and compared with other content 
classification methods. 

II. Related Work 

Dictionary-based classification models have been the focus 
of much recent research leading to results in face recogni¬ 
tion in, ca, ca-Ea, digit recognition ED, object cat¬ 
egorization a, ca and more. Many of these works a, 
Qa, ED utilize the K-SVD ID for their training, or in 
other words, for their dictionary learning step. Others define 
different objective functions such as the Fisher Discriminative 
Dictionary Learning ED- Majority of these methods use an 
alternating optimization process in order to construct their dic¬ 
tionary. This optimization procedure seeks a dictionary which 
is re-constructive, enables sparse representation and sometimes 
also has a discriminative property. In some works (see for 
example Q, 1221) the dictionary learning algorithm requires 
meta parameters to regulate these properties of the learned 
dictionary. Finding the optimal values for these parameters 
is a challenging task that adds complexity to the proposed 
solutions. A dictionary construction, which uses a multivariate 
optimization process, is computationally expensive task (as 
described in ifTSll . for example). The proposed approach in 
this paper suggests to avoid these complexities by using the 
randomized LU Algorithm M- The dictionary it creates is 
under-complete where the number of atoms is smaller than 
the signal dimension. The outcome is that the dictionary 
construction is fast that does not compromise its abilities 
to achieve high classification accuracy. We improve upon 
the state-of-the-art results in file type classification H as 
demonstrated by the experimental results. 


The testing phase in many dictionary learning schemes is 
simple. Usually, linear classifier is used to assign test signals 
to one of the learned classes lH, ED- However, classifier 
learning combined with dictionary learning adds additional 
overhead to the process E2- The proposed method in 
this paper does not require to allocate special attention to a 
classifier learning. We utilize the output from the randomized 
LU algorithm to create a projection matrix. This matrix is 
used to measure the distance between a test signal and the 
dictionary. The signal is then classified as belonging to the 
class that approximates it best. The classification process is 
fast and simple. The results described in Section |Vl] show high 
accuracy in the content-based file type classification task. 

We used this classification task to test the randomized 
LU dictionary construction and to measure its discriminative 
power. This task is useful in computer security applications 
like anti-virus systems and firewalls that need to detect files 
transmitted through network and response quickly to threats. 
Previous works in this field use mainly deep packet inspection 
(DPI) and byte distribution frequency features (1-gram statis¬ 
tics) in order to analyze a file 0-121, Q, lfT0l - lfl2l . ifTsll . In 
some works, other features were tested like consecutive byte 
differences 0, m and statistical properties of the content Q. 
The randomized LU decomposition IThll construction is capa¬ 
ble of dealing with a large number of features. This enables 
us to test our method on high dimensional feature sets like 
double-byte frequency distributions (2-grams statistics) where 
each measurement has 65536 Markov-walk based features. We 
refer the reader to a and references within for an exhaustive 
comparison of the existing methods for content-based file type 
classification. 

Throughout this work, when A is a matrix, the norm ||A|| 
indicates the spectral norm (the largest singular value of A) 
and when A is a vector it indicates the standard I 2 norm 
(Euclidean norm). 


III. Randomized LU 


In this section, we present the randomized LU decomposi¬ 
tion algorithm for computing the rank k LU approximation of 
a full matrix (Algorithm IIILll i. The main building blocks of 
the algorithm are random projections and Rank Revealing LU 
(RRLU) ifT^ to obtain a stable low-rank approximation for an 
input matrix A. 

The RRLU algorithm, used in the randomized LU algo¬ 
rithm, reveals the connection between LU decomposition of a 
matrix and its singular values. This property is very important 
since it connects between the size of the decomposition to the 
actual numerical rank of the data. Similar algorithms exist for 
rank revealing QR decompositions (see, for example 0). 

Theorem III.l ( HI). Let A be an m x n matrix (m> n). 
Given an integer 1 <k <n, then the following factorization 


PAQ 


fLii ^ ^ ^11 ^ 12 \ 

\L21 In-k / V ^22/ ’ 


(III.l) 


holds where Ln is a lower triangular with ones on the 
diagonal, Un is an upper triangular, P and Q are orthogonal 
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permutation matrices. Let CTi > cr 2 > ... > (t„ > 0 the 
singular values of A, then: 

tTk ^ ^ T 7 TT ; T; 

k{n — k) + 1 

and 

o-fc+i < 11^72211 < (7:(n -k) + l)ak+i. 

Based on Theorem IIILII we have the following definition; 

Definition III.l (RRLU Rank k Approximation denoted 
RRLUfe). Given a RRLU decomposition (Theorem I///. 71 ) of 
a matrix A with an integer k (as in Eq. MB such that 
PAQ — LU, then the RRLU rank k approximation is defined 
by taking k columns from L and k rows from U such that 

RRLUkiPAQ) = {UnUi2), 

where Ln, L 2 i,Uii,Ui 2 , P and Q are defined in Theorem 

mj\ 

Lemma III.2 ( HSI RRLU Approximation Error). The error 
of the RRLUk approximation of A is 

\\PAQ - RRLUkiPAQ)\\ < (k(n - k) + Pjau+i- 

Algorithm IIII. II describes the flow of the RLU decomposi¬ 
tion algorithm. 


Algorithm III.l: Randomized LU Decomposition 
Input: A matrix of size m x n to decompose; k rank of 
A; I number of columns to use (for example, I = k + 5). 
Output: Matrices P, Q, L, U such that 
\\PAQ — LU\\ < 0[ak+i{A)) where P and Q are 
orthogonal permutation matrices, L and U are the lower 
and upper triangular matrices, respectively, and (Jk+i (A) 
is the (k + l)th singular value of A. 

1: Create a matrix G of size n x I whose entries are 
i.i.d. Gaussian random variables with zero mean 
and unit standard deviation. 

2: Y ^ AG. 

3: Apply RRLU decomposition (See iflTll ') to Y 
such that PYQy = LyUy. 

4: Truncate Ly and Uy by choosing the first k columns 
and k rows, respectively: Ly ^ Ly{- 1 : k) and 
Uy^Uy(l-.k,-). 

5-. B -(r- L^yPA. (LJ is the pseudo inverse of Ly). 

6: Apply LU decomposition to B with column pivoting 
BQ = LbUb. 

1'. L ^- LyLb. 

8: U ^ Ub. 


Remark III.3. In most cases, it is sufficient to compute the 
regular LU decomposition in Step 3 instead of computing the 
RRLU decomposition. 

The running time complexity of Algorithm HITT] is 
0(mn{l + fc) + Pm + k^ + k^n) (see Section 4.1 and lIThl 
for a detailed analysis). It is shown in Section 4.2 in lIT^ that 


the error bound of Algorithm IIILII is given by the following 
theorem: 

Theorem III.4 ( EH). Given a matrix A of size mxn. Then, 
its randomized LU decomposition produced by Alsorithm \IILl\ 
with integers k and I (I > k) satisfies 

\\LU-PAQ\\ < 

[2y/2nll3‘^^‘^ + 1 + 2s/2niPy (k{n — k) + 1)^ crfc-i-i(^)) 

with probability not less than 

1 6 \ ^ 

^ ~ v'27r(/ - /c + l7 \il-k + l)p) 

1 ( 27^ y 

4 ( 7 ^ — l)y/TTn^ \e^ ) 

for all > 0 and 7 > 1. 

IV. Randomized LU Based Classification 
Algorithm 

This section describes the application of the randomized 
LU Algorithm IIII. II to a classification task. The training phase 
includes dictionary construction for each learned class from a 
given dataset. The classification phase assigns a newly arrived 
signal to one of the classes based on its similarity to the 
learned dictionaries. Let X G be the matrix whose 

n columns are the training signals (samples). Each column 
is defined by m features. Based on Section |III] we apply 
the randomized LU decomposition (Algorithm IIII. Il l to X, 
yielding PXQ r; LU. The outputs P and Q are orthogonal 
permutation matrices. Theorem II V. 1 1 shows that forms 

(up to a certain accuracy) a basis to A. This is the key property 
of the classification algorithm. 

Theorem IV.I. Given a matrix A. Its randomized LU decom¬ 
position is PAQ Ri LU. Then, the error of representing A by 
P^L satisfies: 

\\(P^L)iP^L)^A-A\\< 

^ 2 -\/ 2 n(/?^ 7 ^ + 1 + 2'/2niPy (k{n — k) 1)^ crfc+i(A) 

with the same probability as in Theorem \IIL4\ 

Proof: By combining Theorem IIII. 41 with the fact that 
BQ = LbUb = L^yPAQ we get 

\\LU-PAQ\\ = \\LyLbUb-PAQ\\ = \\LyLlPAQ-PAQ\\. 
Then, by using the fact that Lb is square and invertible we get 

WLyLlPAQ - PAQW = WLyLbLf^LlPAQ - PAQ\\ 

= \\LL^PAQ-PAQ\\. 

By using the fact that the spectral norm is invariant to 
orthogonal projections, we get 

WLL'^PAQ - PAQW = WLL'^PA - PAW 
= WP^LL^PA - All = ||(P^L)(P^L)1'A - All 
< (2\/2nip'^y'^ + 1 + 2s/2nip^ (k(n — fc) + 1)^ ak+i{A), 
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with the same probability as in Theorm [111.41 ■ 

Assume that our dataset is composed of the sets 
Xi,X 2 , ■ ■ ■ ,Xi. We denote by Di = Li the dictionary 
learned from the set Xi by Algorithm IIILII UiQf is the 
corresponding coefficient matrix. It is used to reconstruct 
signals from Xi as a linear combination of atoms from Di. The 
training phase of the algorithm is done by the application of 
Algorithm IIILII to different training datasets that correspond 
to different classes. For each class, a different dictionary is 
learned. The size of Di, namely its number of atoms, is 
determined by the parameter ki that is related to the decaying 
spectrum of the matrix Xi. The dictionaries do not have to be 
of equal sizes. A discussion about the dictionary sizes appears 
later in this section and in Section |V] The third parameter, 
which Algorithm llll.ll needs, is the number of projections I on 
the random matrix columns. I is related to the error bound in 
Theorem lIII.4l and it is used to ensure high success probability 
for Algorithm IIILII Taking Z to be a little bigger than k is 
sufficient. The training process of our algorithm is described 
in Algorithm IIV.II 


Algorithm IV.l: Dictionaries Training using Randomized 
LU _ 

Input: X = {Xi,X 2 , ■. ■ ,Xr} training datasets for r 
sets; K = {fci, k 2 ,. ■., kr} dictionary size of each set. 
Output: D = {Di, £> 2 , ■ •., Dr} set of dictionaries. 

1 : for f G {1,2 ,..., r} do 

Pt, Qt, Lt, Ut ■<— 

Randomized LU Decomposition{Xt,kt,l), 

{I = kt + 5); fAlgorithm lllLlI) 

_ Dt^P^Lt 

2 : D -1^ {Di, D2, . . . , Dr} 


For the test phase of the algorithm, we need a similarity 
measure that provides a distance between a given test signal 
and a dictionary. 

Definition IV.l. Let x be a signal and D be a dictionary. The 
distance between x and the dictionary D is defined by 

dist{x, D) = I \DD^x — a:| |, 

where D^ is the pseudo-inverse of the matrix D. 

The geometric meaning of dist{x,Di) is related to the 
projection of x onto the column space of Di, where Di is 
the dictionary learned for class i of the problem. dist{x, Di) 
denotes the distance between x and DiDjx which is the vector 
X built with the dictionary Di. If x G column-span{Z?i} 
then Theorem IIV.II guarantees that dist{x,Di) < e. For 
X column-span{Z?i}, dist{x,Di) is large. Thus, dist is 
used for classification as described in Algorithm IIV.2I 

The core of Algorithm IIV.2I is the dist function from 
Definition II V. 1 1 This is done by examining portion of the 
signal that is spanned by the dictionary atoms. If the signal 
can be expressed with high accuracy as a linear combination 
of the dictionary atoms then their dist will be small. The 
best accuracy is achieved when the examined signal belongs 


Algorithm IV.2: Dictionary based Classification 

Input: X input test signal; D = {Di,D 2 ,... ,Dr} set of 
dictionaries. 

Output: tx the classified class label for x. 

1: for f G {1,2,..., r} do 
|_ ERRt G- dist(x, Dt) 

2: tx G- argmint {ERRt} 


to the span of Di. In this case, dist is small and bounded 
by Theorem IIIL4I On the other hand, if the dictionary atoms 
cannot express well a signal then their dist will be large. The 
largest distance is achieved when a signal is orthogonal to 
the dictionary atoms. In this case, dist will be equal to the 
norm of the signal. Signal classification is accomplished by 
finding a dictionary with a minimal distance to it. This is 
where the dictionary size comes into play. The more atoms 
a dictionary has, the larger is the space of signals that have 
low dist to it and vice versa. By adding or removing atoms 
from a dictionary, the distances between this dictionary and the 
test signals are changed. This affects the classification results 
of Algorithm IIV.2I The practical meaning of this observation 
is that dictionary sizes need to be chosen carefully. Ideally, 
we wish that each dictionary will be of dist zero to test 
signals of its type, and of large dist values for signals of other 
types. However, in reality, some test signals are represented 
more accurately by a dictionary of the wrong type than by 
a dictionary of their class type. For example, we encountered 
several cases where GIF files were represented more accurately 
by a PDF dictionary than by a GIF dictionary. An incorrect 
selection of the dictionary size, k, will result in either a 
dictionary that cannot represent well signals of its own class 
(causes misdetections), or in a dictionary that represents too 
accurately signals from other classes (causes false alarms). 
The first problem occurs when the dictionary is too small 
whereas the second occurs when the dictionary is too large. 
In Section |V] we discuss the problem of finding the optimal 
dictionaries sizes and how they relate it to the spectrum of the 
training data matrices. 

V. Determining the Dictionaries Sizes 

One possible way to find the dictionaries sizes is to ob¬ 
serve the spectrum decay of the training data matrix. In this 
approach, the number of atoms in each dictionary is selected as 
the number of singular values that capture most of the energy 
of the training matrix. This method is based on estimating 
the numerical rank of the matrix, namely on the dimension 
of its column space. Such a dictionary approximates well the 
column space of the data and represents accurately signals of 
its own class. Nevertheless, it is possible in this construction 
that dictionary of a certain class will have high rate of false 
alarms. In other words, this dictionary might approximate 
signals from other classes with a low error rate. 

Two different actions can be taken to prevent this situation. 
The first option is to reduce the size of this dictionary so that 
it approximates mainly signals of its class and not from other 
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classes. This should be done carefully so that this dictionary 
still identifies well signals of its class better than other 
dictionaries. The second option is to increase the sizes of other 
dictionaries in order to overcome their misdetections. This 
should also be done with caution since we might represent well 
signals from other classes using these enlarged dictionaries. 
Therefore, relying only on the spectrum analysis of the training 
data is insufficient, because this method finds the size of each 
dictionary independently from the other dictionaries. It ignores 
the interrelations between dictionaries, while the classification 
algorithm is based on those relations. Finding the optimal 
k values can be described by the following optimization 
problem; 

arg min Cijx{ki,kj), (V.l) 

l<i<r 

l<j<r 

where Cij^x{ki,kj) is the number of signals from class i 
in the dataset X classified as belonging to class j for the 
respective dictionary sizes ki and kj. The term, which we wish 
to minimize in Eq. IV. 11 is therefore the total number of wrong 
classifications in our dataset X when using a set of dictionaries 
Di,D 2 , ..., Dr with sizes ki,k 2 , ■ ■ ■ ,kr, respectively. 

We propose an algorithm for finding the dictionary sizes 
by examining each specific pair of dictionaries separately, 
and thus identifying the optimized dictionary sizes for this 
pair. Then, the global k values for all dictionaries will be 
determined by finding an agreement between all the local 
results. This process is described in Algorithm lV.il 


Algorithm V.l: Dictionary Sizes Detection 
Input: X = {Xi,X 2 , ■ ■ ■ ,Xr} training datasets for the r 
classes; Krange Set of possible values of k to search in. 
Output: K = {fci, /c2, •.. , kr} dictionaries sizes. 

I: for i,j e {l,2,...,r}, i < j do 
for ki , kj € Xrange dO 

ERRORij{ki,kj) ^ 

_ kj) Cjj^x{kj^ ki) 

2 -. K ^ 

find_optimal_agreement({i<^i?i?Oi?ij}i<i<j<r) 


Algorithm IV. II examines each pair of classes i and j for 
different k values and produces the matrix ERRORij, such 
that the element ERRORij{s,t) is the number of classifica¬ 
tion errors for those two classes, when the dictionary size of 
class i is s and the dictionary size of class j is t. This number 
is the sum of signals from each class that were classified as 
belonging to the other class. The matrix ERRORij reveals 
the ranges of k values for which the number of classihcation 
errors is minimal. These are the ranges that fit when dealing 
with a problem that contains only two classes of signals. 
However, many classification problems need to deal with a 
large number of classes. For this case, we create the ERROR 
matrix for all possible pairs, find the k ranges for each pair and 
then find the optimal agreement between all pairs. The step 
find_optuual_agreement describes this idea in Algorithm lV.il 


Finding this agreement can be done by making a list of 
constraints for each pair and then finding k values that satisfy 
all the constraint and bring the minimal solution to the problem 
described in Eq. IV. II The constraints can bound from below or 
above the size of a specific dictionary, or the relation between 
sizes of two dictionaries (for example, the dictionary of the 
hrst class should have 10 more elements than the dictionary 
of the second class). The step flnd_optimal_agreement is not 
described here formally but demonstrated in details as part of 
Algorithm IV. II in Section IVI-BI 

VI. Experimental Results 

In order to evaluate the performance of the dictionary 
construction and classification algorithms in Section HVl Algo¬ 
rithm IIV.2I was applied to a dataset that contains six different 
file types. The goal is to classify each file or portion of a 
file to the class that describes its type. This dataset consists of 
1200 files that were collected in the wild using automated Web 
crawlers. The files were equally divided into six types; PDF, 
EXE, JPG, GIF, HTM and DOC. 100 files of each type were 
chosen randomly as training datasets and the other 100 files 
served for algorithms testing. In order to get results that reflect 
the true nature of the problem, no restrictions were imposed 
on the file collection process. Thus, some files contain only a 
few kilobytes while others are of several megabytes in size. 
In addition, some of the PDF files contain pictures, which 
make it hard for a content-based algorithm to classify the 
correct file type. Similarly, DOC files may contain pictures 
and the executables may contain text and pictures. Clearly, 
these phenomena have negative effect on the accuracy of the 
results in this section. However, we chose to leave the dataset 
in its original form. 

Throughout this work, we came across several similar 
works Ei-ia, Q, mni-iini, m that classify unknown files 
to their type based on their content. None of these works made 
their datasets publicly available for analysis and comparison 
with other methods. We decided to publicize the dataset of files 
that we collected to enable future comparisons. The details 
about downloading and using the dataset can be obtained by 
contacting one of the authors. 

Three different scenarios were tested with the common goal 
of classifying files or portions of files to their class type, 
namely, assigning them to one of the six file types described 
above. In each scenario, six dictionaries were learned that 
correspond to the six file types. Then, the classification al¬ 
gorithm (Algorithm II V. Il l was applied to classify the type 
of a test fragment or a file. The learning phase, which is 
common to all scenarios, was done by applying Algorithm lV.il 
to find the dictionary sizes and Algorithm II V. 1 1 to construct the 
dictionaries. The testing phase varies according to the specific 
goal of each scenario. Sections IVI-AI rVTBl and FVI-C I provide 
a detailed description for each scenario and its classification 
results. 

A. Scenario A: Entire File is Analyzed 

In this scenario, we process a whole file and the extracted 
features are taken from its entire content. The features are 
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byte frequency distribution (BFD) that contains 256 features 
followed by consecutive differences distribution (CDD) that 
adds another 256 features. Total of 512 features are measured 
for each training and testing files. CDD is used in addition 
to BFD because the latter fails to capture any information 
about bytes ordering in the file. CDD turned out to be very 
discriminative and improved the classification results. The 
features extracted from each file were normalized by its size 
since there are files of various sizes in the dataset. Example 
for BFD construction is described in Fig. IVI.ll and example 
for CDD construction is given in Fig. IV1.2I 


AABCCCDR 


Byte Probability (BFD) 
0^5 

B 0.125 

C 0.375 

D 0.125 

R 0.125 


Fig. VI. 1. Byte Frequency Distribution (BFD) features extracted from the 
file fragment “AABCCCDR”. 


AABCCCDFG 


Difference 

Probability (CDD) 

0 

0.375 

1 

0.5 

2 

0.125 


0 


Fig. VI.2. Consecutive Differences Distribution (CDD) features extracted 
from the file fragment “AABCCCDFG”. There are three consecutive-pairs of 
bytes with difference 0, four with difference 1 and one with difference 2. 
These distributions are normalized to produce the shown probabilities. The 
normalization factor is the length of the string minus one (8 in this example). 


This scenario can be useful when the entire tested file 
is available for inspection. The training was done by the 
application of Algorithms IV. II and IIV. 1 1 to the training data. 
The Krange parameter to Algorithm IV. II was determined 
by the numerical rank of the training matrix. The possi¬ 
ble dictionary sizes need to be close to this rank in or¬ 
der to represent well their datasets. The dictionary sizes 
were 60 atoms per dictionary. The set of dictionaries D — 
{DpDF, Dpo c, D exe, Dcif, Djpc, Dhtm} is the output 
of Algorithm IIV II which is later used for classification of 
test files. Each test file was analyzed using Algorithm IVI.ll 
and classified to one of the six classes. The classification 
results are presented as a confusion matrix in Table IVI.ll 
Each column corresponds to an actual file type and the rows 
correspond to the classified file type by Algorithm IVI.ll A 
perfect classification produces a table with score 100 on the 
diagonal and zero elsewhere. Our results are similar to those 
achieved in ||4| (Table II) that use different methods. However, 
we did not have the dataset that used and there is no way 
to perform a fair comparison. 


Algorithm VI.l: Eile Content Dictionary Classification 
Input: X input file; 

D = {Dppp, Dpoc, Dexe, Dqif, Djpc, Dhtm} set 
of dictionaries. 

Output: tx file type predicted for x. 

1 : for t e {PDF, DOC, EXE, GIF, JPG, HTM] do 
|_ ERRt ^ dist{x, Dt) 

2 : tx ■‘r- argminj {ERRt} 


TABLE VI.l 

Confusion matrix for Scenario A. 100 fi les of each type were 
CLASSIFIED BY ALGORITHM IVI. II 


Classified 
File Type 


Correct File Type 



PDF 

DOC 

EXE 

GlF 

JpG 

HTM 

PDF 

Dd 

0 

1 

1 

0 

0 

DOC 

0 

1)7 

1 

0 

0 

0 

EXE 

0 

3 


2 

1 

0 

“niF 

0 

0 

0 

07 

1 

0 

JPG 

2 

0 

0 

0 

98 

0 

HTM 

0 

0 

0 

0 

0 

100 


B. Scenario B: Fragments of a File 

In this scenario we describe a situation in which the entire 
file is unavailable for the analysis but only some fragments 
that were taken from random locations are available. The 
goal is to classify the file type based on this partial in¬ 
formation. This serves a real application such as a firewall 
that examines packets transmitted through a network or a 
file being downloaded from a network server. This scenario 
contains three experiments where different features were used 
in each. The training phase, which is common to all three 
experiments, includes extracted features from a 10 kilobytes 
fragments that belong to the training data. These features serve 
as an input to Algorithm lIV.il which produces the dictionaries 
for the classification phase. The second parameter in Algo¬ 
rithm ||VT] is a set of dictionary sizes, which were determined 
by Algorithm IV. II We use the first set of features in this 
scenario (described hereafter) to demonstrate more deeply how 
Algorithm IV. 1 1 works. The sizes of six dictionaries need to be 
determined based on the agreement between the pairwise error 
matrices. Eig. lVI.3l shows the matrices ERRORppp.jPG 
ERRORpep-exe- 

Eig. |VI.3(a)| describes the number of classification-errors 
for the PDE and JPG types, as a function of the respective 
dictionary sizes. It can be observed that there is a large number 
of errors for many size pairs, suggesting that the PDE and JPG 
dictionaries exhibit a high measure of similarity. This property 
makes the distinction between these two types a hard task. A 
closer look on Pig. |VI.3(a)| enables us to find the optimal sizes 
for those dictionaries, by making the following observations. 
Only a few values in the cells above the main diagonal provide 
good results for this parr. Additionally, JPG dictionary should 
have 10 atoms more than the PDE dictionary. It cab be also 
learned that both dictionaries sizes should be greater than 50 
atoms. 

The PDE and EXE error values in Pig. |VI.3(b)| indicate that 
these dictionaries are well separated. There is a large set of 
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dictionary sizes near the diagonal for which the classification 
error is low. The following intuition helps to understand why 
a large range of low errors will achieve better classification re¬ 
sults. The error matrices are built based on training data X and 
represent the classification error Cpdf,jpg,x + Cjpc,pdf,x 
of the algorithm when it applies to this data (See Eg. IV. II when 
using 2 sets). The best k values from the ERROR matrix 
fit the training data, in the sense that a PDF training signal 
will be represented more accurately by a PDF dictionary of 
size kpdf than by a JPG dictionary of size kjpg. However, 
this is not necessarily the case for a PDF test signal, which 
may need a larger PDF dictionary or smaller JPG dictionary 
in order to be classified correctly. This might happen because 
many PDF-dictionary atoms are irrelevant for reconstructing 
this signal while too many JPG-dictionary atoms are relevant 
for it. This means that from this signal’s perspective, the PDF 
dictionary size is smaller than kpdf and the JPG dictionary 
is larger than kjpg. In terms of Fig. IVI.3I which shows the 
classification errors for the two discussed parrs, this means 
moving away from the diagonal (which has the best dictionary 
sizes for the training set). In the JPG-PDF case, this shift will 
increase the classification error because all the off-diagonal 
entries in Fig. |VI.3(a)| have higher errors numbers. On the other 
hand, there is a low probability to get a classification error in 
Fig. |VI.3(b)| because there are many off-diagonal options for 
dictionary sizes that will generate a low error. The pair JPG- 
PDF is more sensitive to noise than the pair EXE-PDF. This 
observation is supported by the confusion matrix of the first 
experiment, as shown in Table IVI.2I 

In the first experiment, the dictionary sizes, which were 
determined by Algorithm IV. II are 150 atoms per PDF, DOC, 
EXE, GIF, and HTM dictionaries and 160 atoms per JPG 
dictionary. 10 fragments of 1500 bytes each were sampled 
randomly from each examined file. BFD and CDD based fea¬ 
tures were extracted from each fragment and then normalized 
by the fragment size (similarly to the normalization by file size 
conducted in Scenario A in Section IVI-Ab . Then, the distance 
between each fragment and each of the six dictionaries was 
calculated. The mean value of the distances was computed for 
each dictionary. Eventually, the examined file was classified 
to the class that has the minimal mean value. This procedure 
is described in Algorithm IVI.2I The classification results are 
presented in Table IVI.2I 



k. 

)pg 

(a) Error matrix for the pair PDF-JPG 



(b) Error matrix for the pair PDF-EXE 

Fig. VI.3. Error matrices produced by Algorithm l V. 1 1 The matrix is presented 
in cold to hot colormap to show ranges of low (blue) and high (red) errors. 


fragment. 


TABLE VI.2 

Confusion matrix for Scenario B where BFD+CDD based 

EEATURES WERE CHOSEN. 100 FILES OE EACH TYPE WERE CLASSIFIED BY 

Algorithm IVI.21 


Classified 
File Type 


Correct File Type 



PDF 

DOC 

EXE 

GIF 

IpG 

HTM 

PDF 

03 

0 

2 

0 

14 

0 

DOC 

0 

96 

2 

0 

0 

0 

EXE 

0 

4 

95 

0 

0 

0 

“niF 

0 

0 

0 

TDir 

2 

0 

JPG 

6 

0 

0 

0 

82 

0 

HTM 

1 

0 

1 

0 

2 

100 


The second experiment used a double-byte frequency dis¬ 
tribution (DBFD), which contains 65536 features. Figure IVL4l 
demonstrates the DBFD feature extraction from a small file 


AABCCC 


Double-Byte 

Probability (DBFD) 

AA 

0.2 

AB 

0.2 

BC 

0.2 

CC 

0.4 


0 


Fig. VI.4. Features extracted from the file fragment “AABCCC” using 
Double Byte Frequency Distribution (DBFD). The normalization factor is 
the length of the string minus one. 


Similarly to the first experiment, 10 fragments were sampled 
from random locations at each examined file. However, this 
time we used 2000 bytes per fragment since smaller frag¬ 
ment sizes do not capture sufficient information when DBFD 
features are used. The feature vectors were normalized by 
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Algorithm VI.2: File fragment classification using dictio¬ 
nary learning 

Input: X = {xi,X 2 ,--- ,Xr} input fragments; 

D = {Dpdf, Ddoc, Dexe, Dgif, Djpg, Dhtm} set 
of dictionaries. 

Output: tx file type predicted to X. 

1: for z ^ 1,..., r do 

for t e {PDF, DOC, EXE, GIF, JPG, HTM} do 
|_ ERRi t ■‘r- dist{xi, Dt) 

2: for t € {PDF, DOC, EXE, GIF, JPG, HTM} do 
|_ MEANt ■<— mea.n{ERRi^t}i=i 
3: tx ^ argmiuj {MEANt} 


the fragment’s size as before. Algorithm IVI.2I was applied 
to classify the type of each examined file. The dictionaries 
sizes in this experiment are 80 atoms per PDF, DOC and JPG 
and 60 atoms per EXE, GIE and HTM. The classihcation 
results of this experiment are presented in Table IVI.3I We 
see that DEED based features reveal patterns in the data that 
were not revealed by using BED and CDD based features. In 
particular, it captures very well GIE files that BED and CDD 
based features fail to capture. 

TABLE VI.3 

Confusion matrix for Scenario B that is based on DBFD based 

FEATURES. 100 EILES OE EACH T YPE W ERE CLASSIFIED BY 

Algorithm IVI.2I 


Classified 
File lype 


Correet File Type 



PDF 

DOC 

EXE 

GIF 

JPG 

HTM 

PDF 

*)2 

0 

2 

0 

5 

1 

DOC 

2 

07 

2 

0 

5 

0 

EXE 

3 

1 

88 

2 

0 

0 

“niF 

1 

1 

5 

08 

0 

0 

JPG 

1 

1 

2 

0 

00 

0 

HTM 

1 

0 

1 

0 

0 

09 


of a previous byte. This is well suited to file types such as 
EXE where similar addresses and opcodes are used repeatedly. 
Each memory address or opcode is comprised of two or 
more bytes, therefore, it can be described by the transition 
probability between these bytes. Text files also constitute a 
good example for the applicability of MW based features 
because it is well known that natural language can be described 
by patterns of transition probabilities between words or letters. 
Our study shows that MW based features capture also the 
structure of media files like GIE and HTM files. The relatively 
unsatisfactory performance on JPG files is because our PDE 
dictionary was trained on PDE files that contain pictures. 
Therefore, it detected some of the JPG files. The prediction 
accuracy is described in Table IVLTl Those results (97% avg. 
accuracy) outperform the results obtained by the BEDh-CDD 
and DEED features. It also improve over all the surveyed 
methods in ||4l (Table VI), including the algorithm proposed 
in a, that has 85.5% average accuracy. However, it should 
be noted that we used 10 fragments for the classification of 
each file whereas in a a single fragment is used. In Scenario 
B, the dictionary sizes are 500 atoms per PDE, DOC and 
EXE files, 600 per GIE files, 800 per JPG files and 220 per 
HTM files. The HTM dictionary is smaller than the other 
dictionaries due to the fact that the HTM training set contains 
only 230 samples, and the LU dictionary size is bounded by 
the dimensions of the training matrix (see Algorithm lIII.il) . 


TABLE VI.4 

Confusion matrix for Scenario B using MW based fe ature s. 100 
FILES of each type WERE CLASSIFIED BY ALG0RITHM |VL2I 


Classified 
File Type 


Correct File Type 



PDF 

DOC 

EXE 

GlF 

JpG 

HTM 

PDF 

03 

1 

0 

0 

9 

0 

DOC 

0 

08 

0 

0 

0 

0 

EXE 

2 

0 

98 

1 

0 

0 

“niF 

3 

1 

I 

09 

0 

0 

Tpn 

1 

0 

0 

0 

01 

0 

HTM 

1 

0 

1 

0 

0 

100 


The third experiment defines a Markov-walk (MW) like 
set of 65536 features extracted from the dataset for each 
signal. The transition probability between each pair of bytes is 
calculated. Pigure IVO] demonstrates how to extract MW type 
features from a file fragment. 


Transition Probability (MW) 
05 
0.5 
1 

0.66 
0.33 

0 


Fig. VI.5. Markov Walk (MW) based features extracted from the file 
fragment “AABCCCE”. 


AABCCCF 


A -> A 
A B 
B -r C 
C ->■ C 
C ^ F 


Both MW based features and DEED based features are 
calculated using the double byte frequencies, but they capture 
different information from the data. DEED based features are 
focused on finding pairs of bytes that are most prevalent and 
those who have low chances of appearing in a file. On the 
other hand, MW based features represent the probability that 
a specific byte will appear in the file given the appearance 


C. Scenario C: Detecting Execution Code in PDF Files 

PDE is a common file format that can contain different 
media elements such as text, fonts, images, vector graphics 
and more. This format is widely used in the Web due to the 
fact that it is self contained and platform independent. While 
PDE format is considered to be safe, it can contain any file 
format including executables such as EXE files and various 
script files. Detecting malicious PDE files can be challenging 
as it requires a deep inspection into every file fragment that 
can potentially hide executable code segments. The embedded 
code is not automatically executed when the PDE is being 
viewed using a PDE reader since it first requires to exploit a 
vulnerability in the viewer code or in the PDE format. Still, 
detecting such a potential threat can lead to a preventive action 
by the inspecting system. 

To evaluate how effective our method can be in detecting 
executable code embedded in PDE files, we generated several 
PDE files which contain text, images and executable code. 
We used four datasets of PDE files as our training data: 
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XpDF'- 

XaiFinPOF- 

XjPGinPDF- 

XEXEinPDF- 


100 PDF files containing mostly text. 
100 PDF files containing GIF images. 
100 PDF files containing JPG images. 
100 PDF files containing EXE files. 


All the GIF, JPG and EXE files were taken from previous 
experiments and were embedded into the PDF files. We 
generated 4 dictionaries for each dataset using Algorithm lIV.il 
The input for the algorithm was 


X = {XpDF, XciFinPOF, XjPGinPDF, XeKEiuPDf} ■ 


We then created a test dataset which consisted of: 100 regular 
PDF files and 10 PDF files that contain executable code. 
Algorithm IVI.2I classified the 110 files. The input fragments 
X were the PDF file fragments. The input set of dictionaries 

D = {DpEF, DgiFitiPDF, DjpGinPDF, DEXEinPDF} 

were the output from Algorithm II V. 1 1 A file is classified as 
malicious (contains an executable code) if we find more than 
Texe fragments of type EXE inside, otherwise it is classified 
as a safe PDE file. We used Texe = 10 as our threshold 
since it minimized the total number of miss-classifications. 
The training step was applied to 10 kilobytes fragments and 
the classification step was applied to five kilobytes fragments. 
We used the MW based features (65,536 extracted features). 
By using Algorithm IVI.2I we managed to detect all the 10 
malicious PDF files with 8% of false alarm rate (8 PDF files 
that were classified as malicious PDF files). The results are 
summarized in Table IVI.5I 


TABLE VI.5 

Confusion matrix for malicious PDF detectio n expe riment. 110 
EILES WERE CLASSIFIED BY ALG0RITHM |VI.2I 


Classified 
File Type 


Correct File Type 



PDF 

Malicious PDF 

Safe PDF 

02 

0 

Malicious PDF 

8 

10 


Other file formats, which contain embedded data (DOC files 
for example), can be classified in the same way. 


D. Time Measurements 

Computer security software face frequent situations that 
were described in sections IVl-AHVT^ Therefore, any solution 
to a file type classification must provide a quick response 
to queries. We measured the time required for both the 
training phase and the classification phase of our method that 
classifies a file or a fragment of a file. Since the training 
phase operates offline it does not need to be fast. On the 
other hand, classification query should be fast for real-time 
considerations and for high-volume applications. Tables IVI.6I 
and IVI.7I describe the execution time in Scenarios A (Sec¬ 
tion [VT^ and B (Section IVI-Bb . respectively. The times are 
divided into a preprocessing step and into the actual analysis 
step. The preprocessing includes feature extraction from files 
(data preparation) and loading this data into Matlab. The 
feature extraction was done in Python and the output files 
were loaded to Matlab. Obviously, this is not an optimal 
configuration as it involves intensive slow disk I/O. We did 


not optimize these steps. We note that the computation time 
of the dictionary size is not included in the table, because this 
is a meta-parameter to AIgorithm lIV.il which can be computed 
in different ways, based on the application. The analysis time 
refers to the time needed by Algorithm IIV. 1 1 to build six 
dictionaries (left column in each table) and to classify a single 
file to one of the six classes (right column). The classification 
was performed by Algorithm IVI.ll in Scenario A (Table rVT6]) . 
and by Algorithm lVI.2l in Scenario B (Table IVLTl l. All training 
and classification times are normalized by the data size, which 
allows evaluation of the algorithm performance regardless of 
actual file sizes (which vary largely). Classification time of 
Scenario B is not normalized because Algorithm IVI.2I is not 
dependent on the input file size (it samples the same amount 
of data from each file, ignoring its size). Our classification 
process is fast. The preprocessing step can be further optimized 
for real-time applications. All the experiments were conducted 
on Windows 64-bit, Intel i7, 2.93 GHz CPU machine with 8 
GB of RAM. 


TABLE VL6 

Running times for Scenario A. 


Features 


Training time (sec) 
per 1 MB of data 

Classification time (sec) 
per 1 MB of data 

BFD+CDD 

Preprocessing 

1.8 

1.88 


Analysis 

0.004 

0.0005 


Total 

1.804 

1.8805 


TABLE VL7 

Running times eor Scenario B. 


Features 


Training time (sec) 
per 1 MB of data 

Classification time 
(sec) 

BFD+CDD 

Preprocessing 

1.93 

0.1 (per 1 MB) 


Analysis 

0.008 

0.01 (per file) 


Total 

1.938 


DBFD 

Preprocessing 

13.78 

1.6 (per 1 MB) 


Analysis 

0.54 

0.26 (per file) 


Total 

14.32 


MW 

Preprocessing 

18.42 

2.41 (per 1 MB) 


Analysis 

0.65 

0.27 (per file) 


Total 

19.07 



VII. Conclusion 

In this work, we presented a novel algorithm for dictionary 
construction, which is based on a randomized LU decomposi¬ 
tion. By using the constructed dictionary, the algorithm classi¬ 
fies the content of a file and can deduct its type by examining 
a few file fragments. The algorithm can also detect anomalies 
in PDF files (or any other rich content formats) which can be 
malicious. This approach can be applied to detect suspicious 
files that can potentially contain malicious payload. Anti-virus 
systems and firewalls can therefore analyze and classify PDF 
files using the described method and block suspicious files. 
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The usage of dictionary construction and classification in our 
algorithm is different from other classical methods for file 
content detection, which use statistical methods and pattern 
matching in the file header for classification via deep packet 
inspection. The fast dictionary construction allows to rebuild 
the dictionary from scratch when it is out-of-date which 
is important when building evolving systems that classify 
continuously changing data. 
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